Birthweight Prediction

Final Project
Data Science 2 with R (STAT 301-2)

Author

Cassie Lee

Published

March 4, 2024

Introduction

The developmental origins of health and disease (DOHaD) is a framework that seeks to link exposures to stressors during windows of susceptibility to the development of non-communicable diseases later in life (Lacagnina 2019). Birthweight has historically been used as an indicator of adverse early life exposures to stressors due to the relative ease of collecting birthweight data (Hanson 2015). Low birthweight is a risk factor for developing non-communicable diseases later in life such as cardiovascular diseases, metabolic diseases, and kidney diseases (Bianchi and Restrepo 2022).

Lacagnina, Salvatore. 2019. “The Developmental Origins of Health and Disease (DOHaD).” American Journal of Lifestyle Medicine 14 (1): 47–50. https://doi.org/10.1177/1559827619879694.
Hanson, M. 2015. “The Birth and Future Health of DOHaD.” Journal of Developmental Origins of Health and Disease 6 (5): 434–37. https://doi.org/10.1017/S2040174415001129.
Bianchi, Maria Eugenia, and Jaime M. Restrepo. 2022. “Low Birthweight as a Risk Factor for Non-Communicable Diseases in Adults.” Frontiers in Medicine 8. https://doi.org/https://doi.org/10.3389/fmed.2021.793990.

Objective

The objective of this prediction problem is to create a model to predict the birthweight of an infant given a certain set of characteristics. This is a regression problem. Having a prediction model for this problem can be useful in providing insights to eventually developing an inferential question that identifies the most important factors associated with birthweight.

A classification problem to predict whether or not an infant would be born with low birthweight was not feasible with the dataset used because of high class imbalance. Only about 10% of observations in the dataset would have been classified as low birthweight, which is births under 2,500 grams (Hughes, Black, and Katz 2017).

Hughes, Michelle M., Robert E. Black, and Joanne Katz. 2017. “2500-g Low Birth Weight Cutoff: History and Implications for Future Research and Policy.” Maternal and Child Health Journal 21 (2): 283–89. https://doi.org/10.1007/s10995-016-2131-9.

Data Source

I downloaded the Kaggle dataset, US births (2018) (Amol Deshmukh 2020). The creator of this dataset subsetted this data from the raw 2018 natality file available from the Vital Statistics Online Data Portal (National Center for Health Statistics 2023).

Amol Deshmukh. 2020. “US Births (2018).” https://www.kaggle.com/datasets/des137/us-births-2018.
National Center for Health Statistics. 2023. “Vital Statistics Online Data Portal.” https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm.
  • 1 Kaggle competition

  • I chose this specific dataset because I searched Kaggle for a dataset relating to birthweight and found the Kaggle competition, Prediction interval competition I: Birthweight1 by Carl McBride Ellis. Ellis cited the US births (2018) dataset as its source. I downloaded the US births (2018) dataset to have access to the the full dataset and have more control over the amount of observations I used for exploratory data analysis, the training data, and the testing data.

    The prediction interval competition identified 39 predictors to use, however, due to computational limitations, I selected 20 variables from this list. This selection was done by selecting only variables related to the conditions related to pregnancy — for example, weight gain during pregnancy — and dropping conditions related to birth — for example, the type of facility the birth occurred in.

    Data Overview

    The US births (2018) dataset has over 3 million observations. Due to computational limitations, I subsetted 30,000 observations, using 15,000 for an exploratory data analysis and 15,000 for model fitting and testing.

    Because I had the ability to subset from over 3 million observations, I was able to filter out incomplete observations before randomly sampling a total of 30,000 observations. Therefore, I do not have any major missingness issues. Table 1 shows the results of the missingness analysis. This analysis contains observations used in exploratory data analysis, model fitting and tuning, and tesing. Variables are ordered from the greatest to least missing observations, with only the first five variables shown.

    Although there are not major missingness issues, Table 1 shows that interval since last pregnancy (ilp_r), interval since last birth (illb_r), and which month of pregnancy the mother began prenatal care (precare) have several missing values. I created missingness in these variables because the original data collection encoded multiple types of information under the same variable.

    For the interval since last pregnancy and last birth, if the mother had a plural delivery, the interval since the last birth was encoded as the number of infants (0 to 3) the mother had at the time of birth. From this information, I created a logical variable to identify whether or not the birth was a plural delivery and recoded the interval since last birth and last pregnancy as missing. Similarly, if this was the mother’s first birth or first pregnancy, the interval since last birth and last pregnancy was encoded as 888. From this information, I created two logical variables to identify whether or not this was the mother’s first birth or first pregnancy and recoded the interval since last birth and last pregnancy as missing. For which month of pregnancy the mother began prenatal care, 00 was recorded if the mother did not receive any prenatal care. From this information, I created a logical variable to identify whether or not the mother received prenatal care and recoded the month of pregnancy the mother began prenatal care as missing. I could not filter out these observations due to the information they contained, even as “missing” variables. All other variables are complete.

    Table 1: Missingness analysis conducted on 30,000 randomly sampled observations.
    skim_variable n_missing complete_rate
    ilp_r 368 0.9877333
    illb_r 367 0.9877667
    precare 308 0.9897333
    feduc 0 1.0000000
    meduc 0 1.0000000

    A univariate analysis of birthweight across the 30,000 randomly sampled observations shows a fairly normal distribution, so no transformation will be needed to use birthweight in this prediction problem. Figure 1 shows the distribution of birthweights, highlighting the outliers for babies that were both unusually light and unusually heavy. This distribution is as expected with physiological constraints on both very small and very large infants.

    Figure 1: Target variable analysis conducted on 30,000 randomly sampled observations.

    I performed a short exploratory data analysis using the 15,000 observations set aside for this purpose. This analysis was used to inform both the base recipe and feature engineering in a second unique recipe.

    I conducted bivariate analyses between each of the predictor and birthweight. Only a few of the predictors showed interesting relationships with birthweight. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, and Figure 7 show the relationships between birthweight and the predictors that have more interesting relationships. Figures for the other relationships can be found in the appendix.

    Figure 2: Plural delivery
    Figure 3: Daily cigarettes before pregancy
    Figure 4: Mother’s height
    Figure 5: Mother’s weight before pregnancy
    Figure 6: Number of prenatal visits
    Figure 7: Weight gain during pregnancy

    I also explored correlation between numeric predictors to identify co-linearity, shown in Figure 8. There are a few predictors that are highly correlated with each other, such as the age of the mother and the age of the father, as well as the number of prior births who are still alive and the interval since the last birth and last pregnancy. I do not address these co-linearities in the basic recipe because they are not perfectly co-linear. However, in the second recipe, I do not include the the age of the father or the number of prior births who are still alive, so these co-linearities are not an issue.

    Figure 8: Correlation matrix between numeric predictors

    Methods

    Data Splitting

    The 15,000 observations used for this prediction problem was split using a 75/25 split between training and testing data, stratified by birth weight.

    I used vfold cross validaton because I have a fairly large dataset and do not need to use sampling with replacement as used in bootstrapping. I resampled with 4 folds and 3 repeats, so each workflow was fit 12 times for a metric estimate and standard error. In each iteration of the workflow, about 11,250 observations was used to train and 3,750 observations was used to obtain an estimate for the performance of the model. I used few folds and repeats because I was running into an issue with computation time and computer battery usage.

    Recipes

    I created two distinct recipes, one base recipe and one recipe where I focused on only the predictors that were identified in the exploratory data analysis as having an interesting relationship with birthweight.

    Base Recipe

    The base recipe is a kitchen sink recipe that keeps the variables as is. The variables interval since last pregnancy and interval since last birth have NA values for those with plural deliveries. I changed these NA values to 0. For individuals who did not receive any prenatal care, I changed NA values to 10, essentially indicating negative amounts of prenatal care because imputing these NA values did not make sense. I removed variables that had exact linear correlations between them and variables that did not have any variance. For the recipe for linear models, I dummy coded all nominal predictors and scaled and centered all numeric predictors. For tree based models, I used one-hot encoding to dummy the variables.

    Second Recipe

    For the second recipe, I selected only the variables that had interesting relationships with birthweight apparent in the exploratory data analysis and variables that have been shown to be predictive of birthweight in the literature. These variables were whether or not the birth was a plural delivery, the number of cigarettes smoked daily before pregnancy, the mother’s height, the mother’s weight before pregnancy, the number of prenatal visits, the mother’s weight gain during pregnancy, whether the birth was the mother’s first birth or first pregnancy, whether the mother received any prenatal care, and the mother’s age. Within the recipe, I also operationalized prenatal care as receiving prenatal care beginning in the first trimester or not and created a new variable to identify if mothers are in a higher risk age group (teenage or over 35) (Da Silva et al. 2003; Restrepo-Méndez et al. 2015). For linear based models, interaction terms were used between starting prenatal care early and number of prenatal visits, between weight gain and pre-pregnancy weight, between plural delivery and weight gain, between cigarettes and weight gain, and between interaction between risks and number of prenatal visits. Similar to the base recipe, I removed variables wit exact linear correlations, predictors lacking any variance, and made appropriate adjustments to the recipe between recipes for linear and tree based models.

    Da Silva, Antônio A. M., Vanda M. F. Simões, Marco A. Barbieri, Heloisa Bettiol, Fernando Lamy-Filho, Liberata C. Coimbra, and Maria T. S. S. B. Alves. 2003. “Young Maternal Age and Preterm Birth.” Paediatric and Perinatal Epidemiology 17 (4): 332–39. https://doi.org/10.1046/j.1365-3016.2003.00515.x.
    Restrepo-Méndez, María Clara, Debbie A. Lawlor, Bernardo L. Horta, Alicia Matijasevich, Iná S. Santos, Ana M. B. Menezes, Fernando C. Barros, and Cesar G. Victora. 2015. “The Association of Maternal Age with Birthweight and Gestational Age: A Cross-Cohort Comparison.” Paediatric and Perinatal Epidemiology 29 (1): 31–40. https://doi.org/10.1111/ppe.12162.

    Model Types and Tuning Parameters

    Null

    Simple Linear Regression

    Elastic Net

    Nearest Neighbor

    Random Forest

    Boosted Tree

    Neural Net

    Selection Metric

    The metric I used to identify the best model is RMSE. I used RMSE over MAE partly because it is one of the default metrics used in regression prediction problems, but also because I would like to weigh larger errors more heavily in the performance estimate. This is because babies with a really low or really high birth weight can lead to birth complications. Thus, highly inaccurate predictions would fail to identify risks of low or high birth weight, which can have significant health impacts.

    Model Building and Selection

    Final Model Analysis

    Conclusion

    References

    Appendix: Exploratory Data Analysis